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Abstract 

For bounded datasets such as the TREC Web Track (WTlOg) the computation of term frequency 
(TF) and inverse document frequency (IDF) is not difficult. However, when the corpus is the entire 

ivj ' web, direct IDF calculation is impossible and values must instead be estimated. Most available datasets 

provide values for term count (TC) meaning the number of times a certain term occurs in the entire 
corpus. Intuitively this value is different from document frequency (DF), the number of documents (e.g., 

P^ ' web pages) a certain term occurs in. We conduct a comparison study between TC and DF values within 

the Web as Corpus (WaC). We found a very strong correlation with Spearman's p > 0.8 (p < 0.005) 
C/3 , which makes us confident in claiming that for such recently created corpora the TC and DF values can 

be used interchangeably to compute IDF values. These results are useful for the generation of accurate 
lexical signatures based on the TF-IDF scheme. 
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vn ! 1 Introduction and Motivation 

in 

r~~~ ' In information retrieval (IR) research the term frequency (TF) - inverse document frequency (IDF) concept 

is well known and established to extract the most significant terms while dismissing the more common terms 
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C^^ ' from textual content. It also has been used to generate lexical signatures (LSs) of web pages [H [51 [31 [H [S] . 

TTt: ■ Such LSs can be used to (re-)discover missing web pages when fed back into search engine interfaces. The 

^— V ' computation of TF values for a web page is straight forward since we can simply count the occurrences 

for each term within the page. Two values are mandatory for the IDF computation: the overall amount 
of documents in the corpus and the amount of documents a term appears in. We call the second value 
document frequency (DF). Since both values are unknown when the entire web is the corpus, accurate IDF 
^ ' computation for web pages is impossible and values need to be estimated. 

Various corpora containing web pages, their textual content and their in- and outlinks are available and 
can be used to estimate IDF values since they are considered a representative sample for the Internet [6]. 
The TREC Web Track is probably the most common corpus and has, for example, been used in [5] for 
IDF estimation. The British National Corpus (BNC) [7J, as another example, has been used in [S]. Google 
published the N-grams [Sj in 2006 and hence provides a powerful alternative source for TC values of terms 
extracted from web pages from the Google index. The Web as Corpus kool ynitiative (WaCky) provides the 
WaC corpus as an alternative with no charge for researchers. The problem with these corpora is that they 
do not provide DF values for the terms (or n-term tokens) they contain. We can count the total number 
of documents and therefore determine the DF values in case the corpus documents are provided along with 
the terms. Table [T] gives an overview of selected corpora and their characteristics. The first row indicates 
what kind of documents the corpus is based upon. The row Number of Documents shows the total number 
of documents the corpus consists of (or in the case of the Google N-grams the number of documents the 
corpus was generated from). This row also gives information about whether the documents of the corpus are 
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"A limited number of free copies of the corpus are available from the Linguistic Data Consortium, University of Pennsylvania 

Table 1: Available Text Corpora Characteristics 
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All 


Buy 


Can't 


Is 
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Me 


Need 


Please 


You 


My 


Loving 


Long 
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DF 
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2 
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1 



Table 2: TC-DF Comparison Example 



available. As mentioned above, recognizing the document boundaries within the corpus becomes necessary 
when computing IDF values. 

The row TC indicates whether TC values of the corpus are available. The following simple example is 
to illustrate the difference between TC and DF. Let us consider a corpus of 5 documents D = di-.-d^ where 
each document contains the title of a song by The Beatles: 

di = Please Please Me 
^2 = Can't Buy Me Love 
d^ — All You Need Is Love 
d^ — All My Loving 
(is — Long, Long, Long 

Table [2] shows the TC and DF values of all terms occurring in our small sample corpus. We can see that the 
values are identical for the majority of the terms (8 out of 10). The example also shows that term processing 
such as stemming would have an impact on these numbers since Love and Loving are here treated as different 
terms. 

Since we are interested in computing accurate IDF values for web page content it seems reasonable to 
chose a corpus that is based on textual content of web pages. Even though the TREC WTlOg provides the 
documents and the corpus size seems sufficiently large, it has been shown to be somewhat dated jTl] . 

We are interested in using the Google N-gram dataset as a corpus to generate accurate IDF values from 



but unfortunately Google only provides TC values. The WaC corpus in contrast provides both, TC and DF 
values and therefore we can: 

1. establish a relationship between TC and DF values within the WaC 

2. establish a relationship between WaC based TC and Google N-gram based TC 

3. and finally infer Google N-gram DF from point [T] and point [H 

This paper presents the preliminary results of the study and the results indicate that for sufficiently sized 
and recently generated corpora DF values can be estimated from TC values. 

2 Related Work 

2.1 Correlation between DF and TC Values 

Zhu et al. [12] used an Internet search engine to obtain estimates for DF values of n-grams. They used these 
values to estimate TC values and compared those to TC values from a 103 million word Broadcast News 
corpus which acted as their baseline. They found that the values are very similar and thus conclude that 
values obtained from the web are usable to estimate TC . Keller et al. [13] also used Internet search engines 
to obtain DF values for bigrams. Like Zhu et al. they show a high correlation between values obtained 
from the web and values from a given (traditional) corpus (the BNC). The main application Keller et al. 
suggests is for bigrams that are missing in a given corpus. Nakov et al. [14j show that the n-gram count from 
several Internet search engines differs but is not statistically significantly different. They also show that the 
results from one search engine are stable over time which is encouraging for researchers using this technique 
to obtain TC values. 

All these studies have two things in common: 1) they all show a strong correlation between DF and TC 
values and 2) they use DF estimates from search engines and compare it to TC values from conventional 
corpora. This is where our approach is different since we use TC values from well established text corpora 
and show the correlation to measured DF values. 

2.2 Generating IDF Values for Web Pages 

Sugiyama et al. [5] use the TREC-9 Web Track dataset [TS] to estimate IDF values for web pages. The 
novel part of their work was to also include the content of hyperlinked neighboring pages in the TF-IDF 
calculation of a centroid page. They show that augmenting the generation of TF-IDF values with content 
of in-linked pages increases the retrieval accuracy more than augmenting TF-IDF values with content from 
out-linked pages. They claim that this method represents the web page's content more accurately and hence 
improves the retrieval performance. 

Phelps and Wilensky [T] proposed using the TF-IDF model to generate LSs of web pages and introduced 
"robust hyperlinks" , an URL with a LS appended. Phelps and Wilensky conjectured if the an URL would 
return a HTTP 404 error, the web browser could submit the appended LS to a search engine to either find 
a copy of the page at a different URL or a page with similar content compared to the missing page. Phelps 
and Wilensky did not publish details about how they determined IDF values but stated that the mandatory 
figures can be taken from Internet search engines. That implies the assumption that the index of a search 
engine is representative for all Internet resources. However, they do not publish the value they used for the 
estimated total number of documents on the Internet. 

3 Experiment Setup 

The WaC corpus provides what they call a frequency list, a list of all unique terms in the corpus (lemmatized 
and non-lemmatized) and their TC value. Since the document boundaries in the corpus are obvious, we can 



Table 3: Top 20 Terms and their TC and DF Values 
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this 
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compute the DF values for all terms. Since we are interested in generating TF-IDF values for web pages 
and feeding them back into search engines we dismiss all lemmatized terms and only use the non-lemmatized 
terms. We rank both lists in decreasing order of their TC and DF values and honor ties with the minimum 
value. For example consider four terms a, h, c, d where term a has the highest value, terms h and c have 
the same second highest value and term d has the lowest value. The ranking here would be a=l, b=2, c=2, 
d=4- This kind of ranking is also known as the sports ranking. We compute the Spearman's p and Kendall 
T to mathematically prove the correlation. Table [3] shows the top 20 terms from the WaC corpus ordered by 
decreasing TC and DF values. The similarity between the two rankings already becomes visible with that 
small example since the intersection of both top 20 rankings just holds 22 unique terms. It is not surprising 
that the TC values are much greater than the DF values since for DF duplicates within one document are 
not counted. Since Table [3] mainly holds stop words we show terms ranked between 101 and 120 and their 
TC and DF values in Table ID The correlation is obviously less strong and the number of intersecting terms 
went up to 33. 

4 Experiment Results 



4.1 Correlation within the WaC Corpus 

Figure [T] shows (in loglog scale) the correlation of ranked terms within the WaC corpus. The x-axis represents 
the TC ranks of terms and the y-axis the corresponding DF rank of the same term. As expected we see the 
majority of the points within a diagonal corridor which indicates a great similarity between the rankings. 

Figures [2] and [3] show the measured and estimated correlation between TC and DF values in the WaC 
dataset. The solid black line displays Spearman's p. The increasing size of the dataset is shown on the 
x-axis. The value for p at any size of the dataset is beyond 0.8 which indicates a very strong correlation 
between the rankings. The results are statistically significant with a p-value of 2.2 x 10~^^. The blue line 
in both Figures shows the computed Kendall t values for the top 1,000,000 ranks and the dotted red line 



Tabic 4: Terms ranked between 101 and 120 and their TC and DF Values 
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Term 
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101 
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Figure 1 : Correlation between Term Count and Document Frequency in the WaC dataset 
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Figure 2: Measured and Estimated Correlation between Term Count and Document Frequency in the WaC 
dataset - Normal Scale 
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Figure 3: Measured and Estimated Correlation between Term Count and Document Frequency in the WaC 
dataset - Semi-Log Scale 
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Figure 4: Computation Time for Kendall Tau 



represents the estimated values for the remaining set of data in the WaC corpus. Since the computed r 
values are hard to read on a normal scale (Figure ^ we plotted the same graph in semi- log scale in Figure 
[21 The computed r values vary between 0.82 and 0.74 and the estimated values have a minimum of 0.66. 

We did not compute r for greater ranks since it is a very time consuming operation and the estimated 
values also indicate a strong correlation. Gilpin 16J provides a table for converting r into p values. We 
use this data to estimate our r values. Even though the data in [16j is based on r values computed from a 
dataset with bivariate normal population (which we do not believe to have in the WaC dataset), it supports 
our measured values. For example, it shows that a r value of 0.8 can be converted to a p of 0.94 which is 
consistent with our measured values shown in Figure [2] Therefore we can predict the high r values even 
beyond the top 1, 000, 000 ranks shown in Figure [3l 



4.2 Computation Time for Kendall r 

Figure H] shows the measured and predicted computation time (y-axis, in seconds) for r of top n rankings 
(x-axis). The black solid line shows the measured time values for rankings up to the top 1,000,000 terms. 
The red dashed line represents the predicted time values for the entire corpus and (in the small plot in 
the left top corner) for the top 1,000,000 ranks. Figure S] shows the observed complexity of 0(71^). For 
the entire WaC dataset (over 11 million unique terms) we estimate a computation time for Kendall r of 
almost 11 million seconds or more than 126 days which is clearly beyond a reasonable computation time 
for a correlation value. Kendall r was computed using an off-the-shelf correlation function as part of the 
R- Project, an open source environment for statistical computing. The software (version 2.6) was run on a 
Dell Server with a Pentium P4 2.8Ghz CPU and 1 GB of memory. 



^ http : //www . r-pro j ect . org/ 



4.3 Term Count - Document Frequency Ratio in the WaC Corpus 

Another interesting way to show the correlation between TC and DF values is simply looking at the ratio 



of the two values. Figure 5(a) shows the distribution of TC /DF ratios with values rounded after the second 
decimal and Figure 5(b) shows the ratios rounded after the first decimal. It becomes obvious that the vast 
majority of the ratio values are very small. The visual impression is supported by the computed mean value 
of 1.23 with a standard deviation of cr = 1.21 for both, Figure 5(a) and 5(b) The median of ratios is 1.00 



and 1.0 respectively. Figure 5(c) shows the distribution of TC / DF ratios rounded as integer values. It is 
consistent with the pattern of Figures 5(a) and 5(b) and the mean value is equally low at 1.23 (cr — 1.22). 
The median here is also 1. Figure [5] together with the computed mean and median values accounts for 
another solid indicator for the strong correlation between TC and DF values within the corpus. 



4.4 Correlation between the WaC and the N-gram Corpus 

The TC values for both corpora, WaC and N-gram, are available and therefore we investigate their corre- 
lation. Figure [6] displays (in loglog scale) the frequencies of unique TC values in both corpora. The graph 
shows the TC threshold of 200 Google applied while creating the N-gram. By visual observation it becomes 
obvious that the distribution of TC values in both corpora is very similar. Just the size of the Google N-gram 
corpus is responsible for the offset between the graphs. 

5 Conclusion 

We have shown a very strong correlation between the TC and DF values within the WaC corpus with 
Spearman's p > 0.8 [p < 2.2 x 10~^^). This result leads us to the conclusion that the two values can be used 
interchangeably and therefore TC values are usable for the generation of accurate IDF values. We also show 
(by visual observation) a high correlation between the TC values of the WaC and of the N-gram datasets. 
We can now claim that, despite the fact that the Google N-gram dataset does not contain DF values, the 
corpus and its TC values are also usable for accurate IDF computation which can lead to the generation of 
LSs of web pages. 
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Figure 5: Frequency of TC/DF Ratios in the WaC Corpus 
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Figure 6: Term Count Frequencies in the WaC and N-gram Corpus 
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